name: toc
Table of Contents
class: inverse, center, middle
Pre-bootcamp HW
Talk about your analysis on the
weather data
in groups of 2-3
Let’s get it all out there
When you think of R what comes to mind?
name: whyr
Why are you here wanting to learn R?
Why should you learn R?
- R is free, R improves, R has existed for 40+ years
- Students can show what they have learned to a potential employer easily
- The support system for R is actually much better than some people say
But as a professor, why should you learn R?
- R and R Markdown will save you TONS of time. Long term thinking is key.
- You can easily tweak your code if you need to do another analysis
- Remembering what drop-down menu some thing is in two years from now in a different program will be hard
- Remembering to copy-and-paste your updated plots/analysis into your word processor is a pain and error prone
But as a professor, why should you learn R?
class: inverse, center, middle
DEMO in RStudio
with R Markdown
class: center, middle
What this bootcamp is
- Designed for all levels of knowledge and background
What this bootcamp is not
- A thorough development of machine learning
techniques applied to big data
–
- A discussion on the role of p-values in science
–
- A tutorial on how to work with base R subsetting and base R graphics
What I hope you’ll learn
- That R is not as scary as it used to be.
–
- Learning how to Google is an incredibly valuable skill.
–
- That having students turn in assignments using R scripts and R Markdown provides an efficient way to check their work and their analyses easily.
What I hope you’ll learn
- That the R packages you will use in this workshop are the same ones that are used by scientists and graduate programs all over the world
–
- That teaching students how to use open-source tools is what is best for them long term
–
- That students with no programming background can do great things in only a few weeks
Pieces of advice
- Scaffold & support as a foreign languages do
–
- To be able to use R (and really any other language), students need more than a few assignments and more than a weekly lab
–
- Make use of RStudio Projects (students have a really hard time navigating directory structures)
Pieces of advice
My opinion
- Have students work on writing their code in R script files and documenting their errors
- This is the same workflow that DataCamp uses
–
- Show students R Markdown after a few weeks of working with the script file
- I’ve found it is hard for students to learn R and R Markdown from the start
- Better to have them use R Markdown in groups initially
class: inverse, center, middle
R Data Types
The bare minimum needed for understanding today
Vector/variable - Type of vector (int, num, chr, lgl, date)
–
Data frame - Vectors of (potentially) different types - Each vector has the same number of rows
The bare minimum needed for understanding today
library(tibble)
library(lubridate)
ex1 <- data_frame(
vec1 = c(1980, 1990, 2000, 2010),
vec2 = c(1L, 2L, 3L, 4L),
vec3 = c("low", "low", "high", "high"),
vec4 = c(TRUE, FALSE, FALSE, FALSE),
vec5 = ymd(c("2017-05-23", "1776/07/04", "1983-05/31", "1908/04-01"))
)
ex1
–
class: center, middle
The tidyverse is a collection of R packages that share common philosophies and are designed to work together.

Beginning steps
Frequently the first thing to do when given a dataset is to - identify the observational unit, - specify the variables, - give the types of variables you are presented with, and - check that the data is tidy. (TO COME)
This will help with - choosing the appropriate plot, - summarizing the data, and - understanding which inferences/models can be applied.
class: inverse, center, middle name: viz
Data Visualization

Inspired by Hans Rosling

- What are the variables here?
- What is the observational unit?
- How are the variables mapped to aesthetics?
Grammar of Graphics
.left-column[ Wilkinson (2005) laid out the proposed “Grammar of Graphics”]
.right-column[
]
Grammar of Graphics in R
.left-column[ Wickham implemented the grammar in R in the ggplot2 package]
.right-column[
]
Grammar of Graphics elsewhere
class: center, middle
# What is a statistical graphic?
## A mapping of
data variables
to
aes()thetic attributes
## of geom_etric objects. |
class: inverse, center, middle
Reproducing the plots in ggplot2
1. A scatterplot
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) +
geom_point()
–

Reproducing the plots in ggplot2
2. A scatter plot where the color of the points corresponds to group
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) +
geom_point(mapping = aes(color = D))
–

Reproducing the plots in ggplot2
3. A scatter plot where the size of the points corresponds to C
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B, size = C)) +
geom_point()
–

Reproducing the plots in ggplot2
4. A line graph
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) +
geom_line()
–

Reproducing the plots in ggplot2
5. A line graph where the color of the line corresponds to D with points added that are all blue of size 4.
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) +
geom_line(mapping = aes(color = D)) +
geom_point(color = "forestgreen", size = 4)
–

The Five-Named Graphs
The 5NG of data viz
- Scatterplot:
geom_point()
Line graph: geom_line()
- Histogram:
geom_histogram()
- Boxplot:
geom_boxplot()
- Bar graph:
geom_bar()
class: center, middle
More ggplot2 examples
Histogram
library(nycflights13)
ggplot(data = weather, mapping = aes(x = humid)) +
geom_histogram(bins = 20, color = "black", fill = "darkorange")

Boxplot (broken)
library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, y = humid)) +
geom_boxplot()

Boxplot (almost fixed)
library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, group = month, y = humid)) +
geom_boxplot()
Boxplot (fixed)
library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, group = month, y = humid)) +
geom_boxplot() +
scale_x_continuous(breaks = 1:12)

Bar graph
library(fivethirtyeight)
ggplot(data = bechdel, mapping = aes(x = clean_test)) +
geom_bar()

How about over time?
library(dplyr)
year_bins <- c("'70-'74", "'75-'79", "'80-'84", "'85-'89",
"'90-'94", "'95-'99", "'00-'04", "'05-'09",
"'10-'13")
bechdel <- bechdel %>%
mutate(five_year = cut(year,
breaks = seq(1969, 2014, 5),
labels = year_bins)) %>%
mutate(clean_test = factor(clean_test,
levels = c("nowomen", "notalk", "men",
"dubious", "ok")))
How about over time? (Stacked)
library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
mapping = aes(x = five_year, fill = clean_test)) +
geom_bar()

How about over time? (Side-by-side)
library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
mapping = aes(x = five_year, fill = clean_test)) +
geom_bar(position = "dodge")

How about over time? (Stacked proportional)
library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
mapping = aes(x = five_year, fill = clean_test)) +
geom_bar(position = "fill", color = "black")

class: center, middle
The tidyverse/ggplot2 is for beginners and
for data science professionals!

Practice
Produce appropriate 5NG with R package & data set in [ ], e.g., [nycflights13 \(\rightarrow\) weather]
Does age predict recline_rude?
[fivethirtyeight \(\rightarrow\) na.omit(flying)]
Distribution of age by sex
[okcupiddata \(\rightarrow\) profiles]
Does budget predict rating?
[ggplot2movies \(\rightarrow\) movies]
Distribution of log base 10 scale of budget_2013
[fivethirtyeight \(\rightarrow\) bechdel]
HINTS

class: inverse, center, middle
DEMO of ggplot2 in RStudio
class: center, middle
Determining the appropriate plot

class: inverse, center, middle name: wrangle
Day 2
Data Wrangling
gapminder data frame in the gapminder package
library(gapminder)
gapminder
Base R versus the tidyverse
Say we wanted mean life expectancy across all years for Asia
–
# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)
[1] 60.0649
–
library(dplyr)
gapminder %>% filter(continent == "Asia") %>%
summarize(mean_exp = mean(lifeExp))
The pipe %>%

A way to chain together commands
- It is essentially the
dplyr equivalent to the
+ in ggplot2
## The 5NG of data viz
geom_point()
geom_line()
geom_histogram()
geom_boxplot()
geom_bar()
The Five Main Verbs (5MV) of data wrangling
filter()
summarize()
group_by()
mutate()
arrange()
filter()
library(gapminder); library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
head(gap_2007)
- Use
== to compare a variable to a value
Logical operators
Use | to check for any in multiple filters being true:
gapminder %>%
filter(year == 2002 | continent == "Europe")
–
Logical operators
Use & or , to check for all of multiple filters being true:
gapminder %>%
filter(year == 2002, continent == "Europe")
Logical operators
Use %in% to check for any being true
(shortcut to using | repeatedly with ==)
gapminder %>%
filter(country %in% c("Argentina", "Belgium", "Mexico"),
year %in% c(1987, 1992))
–
summarize()
- Any numerical summary that you want to apply to a column of a data frame is specified within
summarize().
max_exp_1997 <- gapminder %>%
filter(year == 1997) %>%
summarize(max_exp = max(lifeExp))
max_exp_1997
–
Combining summarize() with group_by()
When you’d like to determine a numerical summary for all levels of a different categorical variable
max_exp_1997_by_cont <- gapminder %>%
filter(year == 1997) %>%
group_by(continent) %>%
summarize(max_exp = max(lifeExp))
max_exp_1997_by_cont
–
Without the %>%
It’s hard to appreciate the %>% without seeing what the code would look like without it:
max_exp_1997_by_cont <-
summarize(
group_by(
filter(
gapminder,
year == 1997),
continent),
max_exp = max(lifeExp))
max_exp_1997_by_cont
ggplot2 revisited
For aggregated data, use geom_col
ggplot(data = max_exp_1997_by_cont,
mapping = aes(x = continent, y = max_exp)) +
geom_col()
The 5MV
filter()
summarize()
group_by()
mutate()
mutate()
- Allows you to
- create a new variable with a specific value OR
- create a new variable based on other variables OR
- change the contents of an existing variable
–
gap_plus <- gapminder %>% mutate(just_one = 1)
head(gap_plus)
mutate()
- Allows you to
- create a new variable with a specific value OR
- create a new variable based on other variables OR
- change the contents of an existing variable
–
gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
head(gap_w_gdp)
mutate()
- Allows you to
- create a new variable with a specific value OR
- create a new variable based on other variables OR
- change the contents of an existing variable
–
gap_weird <- gapminder %>% mutate(pop = pop + 1000)
head(gap_weird)
arrange()
Reorders the rows in a data frame based on the values of one or more variables
gapminder %>%
arrange(year, country)
arrange()
Can also put into descending order
gapminder %>%
filter(year > 2000) %>%
arrange(desc(lifeExp))
Don’t mix up arrange and group_by
Don’t mix up arrange and group_by
This doesn’t really do anything useful
gapminder %>% group_by(year)
Don’t mix up arrange and group_by
But this does
gapminder %>% arrange(year)
Changing of observation unit
True or False > Each of filter, mutate, and arrange change the observational unit.
–
True or False > group_by() %>% summarize() changes the observational unit.
What is meant by “joining data frames” and
why is it useful?
–

Does cost of living in a state relate to whether police officers live in the cities they patrol? What about state political ideology?
library(fivethirtyeight); library(readr); library(dplyr)
police2 <- police_locals %>% select(city, all)
ideology <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/ideology.csv")
police_join <- inner_join(x = police2, y = ideology, by = "city")
police_join
–
url <- paste0("https://ismayc.github.io/",
"Effective-Data-Storytelling-using-the-tidyverse/",
"datasets/cost_of_living.csv")
cost_of_living <- read_csv(url)
police_join_cost <- inner_join(
x = police_join,
y = cost_of_living,
by = "state"
)
police_join_cost %>% select(-state) %>% arrange(city)
Does cost of living in a state relate to whether police officers live in the cities they patrol? What about state political ideology?
ggplot(data = police_join_cost,
mapping = aes(x = index, y = all)) +
geom_point(mapping = aes(color = state_ideology)) +
labs(x = "Cost of Living Index", y = "% Officers Living in City")

Practice (Lay out what the resulting table should look like on paper first.)
Use the 5MV to answer problems from R data packages, e.g., [nycflights13 \(\rightarrow\) weather]
What is the maximum arrival delay for each carrier departing JFK? [nycflights13 \(\rightarrow\) flights]
Determine the top five movies in terms of domestic return on investment for 2013 scaled data
[fivethirtyeight \(\rightarrow\) bechdel]
Include the name of the destination airport as a column in the flights data frame
[nycflights13 \(\rightarrow\) flights, airports]
class: inverse, center, middle
DEMO of dplyr in RStudio
class: inverse, center, middle
Data Importing and Tidying
tidyverse packages
IMPORT
haven for SPSS, SAS, and Stata data files
jsonlite for JSON files
readxl for XLS and XLSX files
readr for CSV and TSV files (and R specific RDS files)
TIDYING
tidyr for converting “messy” into “tidy” data frames
name: import
Basics of Importing
We will begin by downloading and importing a variety of different “messy” data sets. You can download all of them in a ZIP file at http://bit.ly/reed-messy-data. The links below go to the original sources. I’ve converted these original sources into different formats.
class: center, middle, inverse
Demonstration in RStudio
Practice
- Download the four other files and import them into R using the appropriate package
- You may need to check out the help pages for the different packages
- Give them the following names:
who_df, county_pop_df, edu_county_df, and potus12_df
name: tidy class: inverse, middle, center
Tidy Data

- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
The third point means we don’t mix apples and oranges.
What is Tidy Data?
- Each observation forms a row. In other words, each row corresponds to a single instance of an observational unit
- Each variable forms a column:
- Some variables may be used to identify the observational units.
- For organizational purposes, it’s generally better to put these in the left-hand columns
- Each type of observational unit forms a table with the entries in the table corresponding to values.
Differentiating between neat data and tidy data
- Colloquially, they mean the same thing
- But in our context, one is a subset of the other.
Neat data is - easy to look at, - organized nicely, and - in table form.
–
Tidy data is neat but also abides by a set of three rules.
class: center, middle


Is this tidy?
|
year
|
title
|
clean_test
|
budget_2013
|
|
1995
|
Apollo 13
|
ok
|
99370665
|
|
2005
|
Brokeback Mountain
|
notalk
|
16583160
|
|
2010
|
Diary of a Wimpy Kid
|
ok
|
16023478
|
|
1984
|
Dune
|
dubious
|
100864980
|
|
1984
|
Ghostbusters
|
notalk
|
67243320
|
|
2003
|
How to Lose a Guy in 10 Days
|
men
|
63304348
|
|
2011
|
Iris
|
ok
|
5696299
|
|
2004
|
Sideways
|
ok
|
20964279
|
|
2000
|
Songcatcher
|
ok
|
2435235
|
|
2004
|
Team America: World Police
|
men
|
24663858
|
|
2010
|
Tron Legacy
|
notalk
|
213646368
|
|
2011
|
War Horse
|
notalk
|
72498355
|
name: demscore
How about this? Is this tidy?
|
country
|
1952
|
1957
|
1962
|
1967
|
1972
|
1977
|
1982
|
1987
|
1992
|
1997
|
2002
|
2007
|
|
Albania
|
-9
|
-9
|
-9
|
-9
|
-9
|
-9
|
-9
|
-9
|
5
|
5
|
7
|
9
|
|
Argentina
|
-9
|
-1
|
-1
|
-9
|
-9
|
-9
|
-8
|
8
|
7
|
7
|
8
|
8
|
|
Armenia
|
-9
|
-7
|
-7
|
-7
|
-7
|
-7
|
-7
|
-7
|
7
|
-6
|
5
|
5
|
|
Australia
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
|
Austria
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
|
Azerbaijan
|
-9
|
-7
|
-7
|
-7
|
-7
|
-7
|
-7
|
-7
|
1
|
-6
|
-7
|
-7
|
|
Belarus
|
-9
|
-7
|
-7
|
-7
|
-7
|
-7
|
-7
|
-7
|
7
|
-7
|
-7
|
-7
|
|
Belgium
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
10
|
8
|
|
Bhutan
|
-10
|
-10
|
-10
|
-10
|
-10
|
-10
|
-10
|
-10
|
-10
|
-10
|
-10
|
-6
|
|
Bolivia
|
-4
|
-3
|
-3
|
-4
|
-7
|
-7
|
8
|
9
|
9
|
9
|
9
|
8
|
|
Brazil
|
5
|
5
|
5
|
-9
|
-9
|
-4
|
-3
|
7
|
8
|
8
|
8
|
8
|
|
Bulgaria
|
-7
|
-7
|
-7
|
-7
|
-7
|
-7
|
-7
|
-7
|
8
|
8
|
9
|
9
|
name: whytidy
Why is tidy data important?
Think about trying to plot democracy score across years in the simplest way possible.
- It would be much easier if the data looked like what follows instead so we could put
year on the x-axis and
dem_score on the y-axis.
Tidy is good
|
country
|
year
|
dem_score
|
|
Argentina
|
1962
|
-1
|
|
Armenia
|
1997
|
-6
|
|
Denmark
|
1962
|
10
|
|
Ethiopia
|
2007
|
1
|
|
Finland
|
2007
|
10
|
|
Ireland
|
1992
|
10
|
|
Libya
|
1957
|
-7
|
|
Libya
|
1982
|
-7
|
|
Mexico
|
1977
|
-3
|
|
Mexico
|
1982
|
-3
|
|
Spain
|
1962
|
-7
|
|
Switzerland
|
1982
|
10
|
|
Ukraine
|
1997
|
7
|
Let’s plot it
- Plot the line graph for three countries using
ggplot
dem_score4 <- dem_score_tidy %>%
filter(country %in% c("Pakistan", "Portugal", "Uruguay"))
ggplot(data = dem_score4, mapping = aes(x = year, y = dem_score)) +
geom_line(mapping = aes(color = country))

Tidying
The Life Expectancy by year data
library(readr)
life_exp_df <- read_csv("data/le_mess.csv")
#View(life_exp_df)

Doing the “tidying”/reshaping
library(tidyr)
library(dplyr)
(life_exp_tidy <- life_exp_df %>%
gather(key = "year", value = "life_exp", -country))
Tidying
World Health Organization TB data (Stata DTA)
library(haven)
who_df <- read_dta("data/who.dta")
#View(who_df)

WHO ugly…
- This data set contains strange variable names that will require us to look up their meaning in the corresponding data dictionary.
–
- What do we know?
country, iso2, and iso3 are redundant and identify the country
year is a variable and it looks like it corresponds to each country
- But what in the world is
new_sp_m014? And the other columns?…
First step
Like before, we can gather these non-country and non-year variables together:
who_tidy <- who_df %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "value")
We can now see what this has done to the data frame:
Data dictionary saves us some…
- The first three letters of entries in the
key column correspond to new or old cases of TB.
- The next two letters (after the
_) correspond to TB type:
rel for relapse, ep for extrapulmonary TB
sn for smear negative, sp for smear positive
- The next letter after the second
_ corresponds to the sex of the TB patient.
- The remaining numbers correspond to age group:
014 for 0 to 14 years
- …
65 for 65 or older
Looking over the entries of key in who_tidy, do you see anything else that doesn’t match the pattern?
It may be easier to remember that the entries of key correspond to column names in who_df:
[1] "country" "iso2" "iso3" "year"
[5] "new_sp_m014" "new_sp_m1524" "new_sp_m2534" "new_sp_m3544"
[9] "new_sp_m4554" "new_sp_m5564" "new_sp_m65" "new_sp_f014"
[13] "new_sp_f1524" "new_sp_f2534" "new_sp_f3544" "new_sp_f4554"
[17] "new_sp_f5564" "new_sp_f65" "new_sn_m014" "new_sn_m1524"
[21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564"
[25] "new_sn_m65" "new_sn_f014" "new_sn_f1524" "new_sn_f2534"
[29] "new_sn_f3544" "new_sn_f4554" "new_sn_f5564" "new_sn_f65"
[33] "new_ep_m014" "new_ep_m1524" "new_ep_m2534" "new_ep_m3544"
[37] "new_ep_m4554" "new_ep_m5564" "new_ep_m65" "new_ep_f014"
[41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554"
[45] "new_ep_f5564" "new_ep_f65" "newrel_m014" "newrel_m1524"
[49] "newrel_m2534" "newrel_m3544" "newrel_m4554" "newrel_m5564"
[53] "newrel_m65" "newrel_f014" "newrel_f1524" "newrel_f2534"
[57] "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"
A fix using stringr
You can replace all of the entries in key that contained newrel with new_rel via
library(stringr)
library(dplyr)
who_tidy <- who_tidy %>%
mutate(key = str_replace(
string = key,
pattern = "newrel",
replacement = "new_rel")
)
Pulling apart variables
The entry new_sp_m014 is actually four different variables. Use the separate function to pull them apart:
who_tidy <- who_tidy %>%
separate(col = key, into = c("new", "type", "sexage"), sep = "_")
who_tidy
One more pull apart
- We need to pull
sexage into two different variables.
- If you use
?separate, you’ll see that the following is an option:
(who_tidy <- who_tidy %>%
separate(col = sexage, into = c("sex", "age"), sep = 1))
Final cleaning
iso2 and iso3 are redundant if we have country
new is constant since this only has new cases of TB
- Let’s just ignore missing values here
- We also know that
value is actually cases so we can rename that column.
who_tidy <- who_tidy %>% select(-iso2, -iso3, -new) %>%
na.omit() %>% rename("cases" = value)
head(who_tidy, 5)
The power of the %>% (pipe)
Everything we did above can be chained into one sequence:
who_tidy <- who_df %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "value") %>%
mutate(key = str_replace(key, pattern = "newrel",
replacement = "new_rel")) %>%
separate(col = key, into = c("new", "type", "sexage"), sep = "_") %>%
separate(col = sexage, into = c("sex", "age"), sep = 1) %>%
select(-iso2, -iso3, -new) %>%
na.omit() %>%
rename("cases" = value)
BONUS
A Gapminder tidy dataset read in from a Google Sheet!
- I’ve updated the
gapminder data set available in the gapminder R data package by Jenny Bryan here. Jenny provides instructions for recreating the data.
BONUS
- You can download the updated data from Google Sheets by running the following in R:
# Link is https://docs.google.com/spreadsheets/d/18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA/edit?usp=sharing
library(googlesheets)
updated_gap <- gs_key("18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA") %>%
gs_read()
- A script going through the steps to take the “messy” original Gapminder.org files and turn them into this tidy dataset is available here.
Practice (after the bootcamp)
Try to tidy the other data sets you downloaded and imported or other data sets you have!
name: model class: inverse, center, middle
Data Modeling
Modeling
- Experience with
ggplot2 package and knowledge of the Grammar of Graphics primes students for modeling
- Use of the
broom package to unpack regression output into a tidy format
1. ggplot2 Primes Regression
Example:
- All Alaskan Airlines and Frontier flights leaving NYC in 2013
- We want to study the relationship between temperature and departure delay
- For summer (June, July, August) and non-summer months separately
Involves four variables:
carrier, temp, dep_delay, summer
1. ggplot2 Primes Regression
flights_subset <- flights %>%
filter(carrier == "AS" | carrier == "F9") %>%
left_join(weather, by=c("year", "month", "day", "hour", "origin")) %>%
filter(dep_delay < 250) %>%
mutate(summer = ifelse(month == 6 | month == 7 | month == 8, "Summer Flights", "Non-Summer Flights"))
ggplot(data = flights_subset,
mapping = aes(x = temp, y = dep_delay, color = carrier)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ summer)

1. ggplot2 Primes Regression
Why? Dig deeper into data. Look at origin and dest variables as well:
flights_subset %>%
group_by(carrier, origin, dest) %>%
summarise(`Number of Flights` = n())
2. broom Package
- The
broom package takes the messy output of built-in modeling functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.
- Fits in with
tidyverse ecosystem
- This works for many R data types!
2. broom Package
In our case, broom functions take lm objects as inputs and return the following in tidy format!
tidy(): regression output table
augment(): point-by-point values (fitted values, residuals, predicted values)
glance(): scalar summaries like \(R^2\) ,
2. broom Package
Example
library(tidyverse)
library(nycflights13)
library(knitr)
library(broom)
set.seed(2017)
# Load Alaska data, deleting rows that have missing departure delay
# or arrival delay data
alaska_flights <- flights %>%
filter(carrier == "AS") %>%
filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
sample_n(50)
# Exploratory Data Analysis-----------------------------------------
# Plot of sample of points:
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point()

# Correlation coefficient:
alaska_flights %>% summarize(correl = cor(dep_delay, arr_delay))
# Add regression line
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red")

# Fit Regression and Study Output with broom Package-------------------
# Fit regression
options(scipen = 6) # Set scientific notation
delay_fit <- lm(formula = arr_delay ~ dep_delay, data = alaska_flights)
# 1. broom::tidy() regression table with confidence intervals
# and no p-value stars
regression_table <- delay_fit %>%
tidy(conf.int = TRUE)
regression_table
# 2. broom::augment() for point-by-point values
regression_points <- delay_fit %>%
augment() %>%
select(arr_delay, dep_delay, .fitted, .resid)
regression_points
# and for prediction
new_flights <- data_frame(dep_delay = c(25, 30, 15))
delay_fit %>%
augment(newdata = new_flights)
# 3. broom::glance() scalar summaries of regression
regression_summaries <- delay_fit %>%
glance()
regression_summaries
# Residual Analysis------------------------------------------------------
ggplot(data = regression_points, mapping = aes(x = .resid)) +
geom_histogram(binwidth=10, color = "white") +
geom_vline(xintercept = 0, color = "blue")

ggplot(data = regression_points, mapping = aes(x = .fitted, y = .resid)) +
geom_point() +
geom_abline(intercept = 0, slope = 0, color = "blue")

ggplot(data = regression_points, mapping = aes(sample = .resid)) +
stat_qq()

class: center, middle, inverse name: rmarkdown
Data Communicating
What is Markdown?
- A “plaintext formatting syntax”
- Type in plain text, render to more complex formats
- One step beyond writing a
txt file
- Render to HTML, PDF, DOCX, etc. using Pandoc
What does it look like?
.left-column[
# Header 1
## Header 2
Normal paragraphs of text go here.
**I'm bold**
[links!](http://rstudio.com)
* Unordered
* Lists
And Tables
---- -------
Like This
]
.right-column[
]
What is R Markdown?
- “Literate programming”
- Embed R code in a Markdown document
- Renders textual output along with graphics
.left-column[
```{r chunk1}
library(ggplot2)
library(nycflights13)
pdx_flights <- flights %>%
filter(dest == "PDX", month == 5)
nrow(pdx_flights)
```
```{r chunk2}
ggplot(data = pdx_flights,
mapping = aes(x = arr_delay,
y = dep_delay)) +
geom_point()
```
]
.right-column[
[1] 88
]
Practice
Turn a statistical analysis you have conducted into an R Markdown document.
class: middle
---
title: Putting the [`R`]() in [`R`]()eed and <br> in Lewis and Cla[`R`]()k
author: "Chester Ismay <br><br> GitHub: [`ismayc`](https://github.com/ismayc) <br> Twitter:  [`@old_man_chester`](https://twitter.com/old_man_chester)"
date: 2017-05-25 & 2017-05-26 <br><br> Bootcamp website at <http://bit.ly/rbootcamp17> <br> Slides available at <http://bit.ly/rbootcamp17-slides>
output: 
    html_document:
      toc: true
      toc_float: true
      toc_depth: 1
      theme: cosmo
      highlight: tango
      df_print: paged
      code_download: true
---

name: toc

# Table of Contents

- [Data Viz](#viz)
- [Data Wrangling](#wrangle)
- [Data Importing](#import)
- [Data Tidying](#tidy)
- [Data Modeling](#model)
- [Data Communicating](#rmarkdown)


---

class: inverse, center, middle

<!-- Write slide link on whiteboard -->

```{r include=FALSE}
#options(servr.daemon = TRUE)
library(tidyverse)
library(mosaic)
library(stringr)
library(okcupiddata)
library(knitr)
filter <- dplyr::filter
knitr::opts_chunk$set(warning=FALSE, message=FALSE, fig.width=9.5, fig.height=4.5, 
  comment=NA, rows.print=16)
theme_set(theme_gray(base_size = 24))
```

# Pre-bootcamp HW

## Talk about your analysis on the <br> `weather` data <br> in groups of 2-3

---

## Let's get it all out there

## When you think of R what comes to mind?

---

name: whyr

## Why are you here wanting to learn R?

- ~~Because you enjoy partaking in [Schadenfreude](https://en.wikipedia.org/wiki/Schadenfreude)~~

<center>
<img src="figure/google_scholar.png" style="width: 700px;"/>
</center>

---

## Why should you learn R?

- R is free, R improves, R has existed for 40+ years
- Students can show what they have learned to a potential employer easily
- The support system for R is actually much better than some people say


---

## But as a professor, why should you learn R?

- R and R Markdown will save you TONS of time. Long term thinking is key.
    - You can easily tweak your code if you need to do another analysis
    - Remembering what drop-down menu some thing is in two years from now in a different program will be hard
    - Remembering to copy-and-paste your updated plots/analysis into your word processor is a pain and error prone

---

## But as a professor, why should you learn R?
    
<center>
<img src="figure/tweet.jpg" style="width: 700px;"/>
</center>

---

class: inverse, center, middle

# DEMO in RStudio <br> with R Markdown

---

class: center, middle

## Analogy

R            |  RStudio |  DataCamp
:-------------------------:|:-------------------------:|:-------------------------:
<img src="figure/engine.jpg" alt="Drawing" style="width: 350px;"/>  |  <img src="figure/dashboard.jpg" alt="Drawing" style="width: 350px;"/>  |  <img src="figure/instructor.jpg" alt="Drawing" style="width: 350px;"/>

---

## What this bootcamp is

![](https://ourcodingclub.github.io/img/tidyverse.png)

- Designed for all levels of knowledge and background

---

## What this bootcamp is not

- A thorough development of machine learning <br> techniques applied to big data

--

- A discussion on the role of _p_-values in science

--

- A tutorial on how to work with base R subsetting and base R graphics 

---

## What I hope you'll learn

- That R is not as scary as it used to be.  

--

- Learning how to Google is an incredibly valuable skill.

--

- That having students turn in assignments using R scripts and R Markdown provides an efficient way to check their work and their analyses easily.

---

## What I hope you'll learn

- That the R packages you will use in this workshop are the same ones that are used by scientists and graduate programs all over the world

--

- That teaching students how to use open-source tools is what is best for them long term

--

- That students with no programming background can do great things in only a few weeks
    - [Sociology/Criminal Justice majors at Pacific U.](https://ismayc.github.io/soc301_s2017/group-projects/index.html)
    - [A wide range of students at Middlebury College](https://rudeboybert.github.io/MATH116/PS/final_project/final_project_outline.html)

---

## Pieces of advice

<center>
<a href="http://giphy.com/gifs/pool-diving-synchronized-swimming-pDWtwK7D2IlFu"><img src="figure/giphy2.gif" style="width: 300px;"/></a>
</center>

- Scaffold & support as a foreign languages do

--

- To be able to use R (and really any other language), students need more than a few assignments and more than a weekly lab

--

- Make use of RStudio Projects (students have a really hard time navigating directory structures)


---

## Pieces of advice

### My opinion

- Have students work on writing their code in R script files and documenting their errors
    - This is the same workflow that DataCamp uses

--

- Show students [R Markdown after a few weeks](https://ismayc.github.io/soc301_s2017/schedule/) of working with the script file
    - I've found it is hard for students to learn R and R Markdown from the start
    - Better to have them use R Markdown in groups initially


---


class: inverse, center, middle

# R Data Types

---

## The bare minimum needed for understanding today

Vector/variable
  - Type of vector (`int`, `num`, `chr`, `lgl`, `date`)

--

Data frame
  - Vectors of (potentially) different types
  - Each vector has the same number of rows

---

## The bare minimum needed for understanding today

```{r eval=FALSE}
library(tibble)
library(lubridate)
ex1 <- data_frame(
    vec1 = c(1980, 1990, 2000, 2010),
    vec2 = c(1L, 2L, 3L, 4L),
    vec3 = c("low", "low", "high", "high"),
    vec4 = c(TRUE, FALSE, FALSE, FALSE),
    vec5 = ymd(c("2017-05-23", "1776/07/04", "1983-05/31", "1908/04-01"))
  )
ex1
```

--

```{r echo=FALSE}
library(tibble)
library(lubridate)
ex1 <- data_frame(
    vec1 = c(1980, 1990, 2000, 2010),
    vec2 = c(1L, 2L, 3L, 4L),
    vec3 = c("low", "low", "high", "high"),
    vec4 = c(TRUE, FALSE, FALSE, FALSE),
    vec5 = ymd(c("2017-05-23", "1776/7/04", "1983-5/31", "1908/04-1"))
  )
ex1
```
  
---

class: center, middle  
  
## Welcome to the [tidyverse](https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/)!
  
The `tidyverse` is a collection of R packages that share common philosophies and are designed to work together. <br><br> 
  
<a href="http://tidyverse.tidyverse.org/logo.png"><img src="figure/tidyverse.png" style="width: 200px;"/><a>

---


## Beginning steps

Frequently the first thing to do when given a dataset is to
- identify the observational unit,
- specify the variables,
- give the types of variables you are presented with, and
- check that the data is <u>tidy</u>. (TO COME)

This will help with 
- choosing the appropriate plot, 
- summarizing the data, and 
- understanding which inferences/models can be applied.

---

class: inverse, center, middle
name: viz


# Data Visualization

<a href="http://gitsense.github.io/images/wealth.gif"><img src="figure/wealth.gif" style="width: 770px;"/></a>

Inspired by [Hans Rosling](https://www.youtube.com/watch?v=jbkSRLYSojo)

---

```{r echo=FALSE,fig.height=6, fig.width=10, fig.align='center'}
library(gapminder)
options(scipen = 99)
#gap_with_colors <-
#  data.frame(gapminder,
#             cc = I(country_colors[match(gapminder$country,
#                                         names(country_colors))]))

gapminder %>% filter(year == 1992) %>%
  ggplot(aes(x = log(gdpPercap, base = 10), y = lifeExp, color = continent,
             size = pop)) +
  geom_point() + xlab('Gross Domestic Product (log scale)') + ylab('Life Expectancy at birth (years)') + ggtitle("Gapminder for 1992")

#+
#  scale_color_manual(values = gapminder::continent_colors)
```


- What are the variables here?
- What is the observational unit?
- How are the variables mapped to aesthetics?

---


## Grammar of Graphics

.left-column[
Wilkinson (2005) laid out the proposed "Grammar of Graphics"
]

.right-column[
<a href="http://www.powells.com/book/the-grammar-of-graphics-9780387245447"><img src="figure/graphics.jpg" style="width: 300px;"></a>
]

---

## Grammar of Graphics in R

.left-column[
Wickham implemented the grammar in R in the `ggplot2` package
]

.right-column[
<a href="http://www.powells.com/book/ggplot2-elegant-graphics-for-data-analysis-9783319242750/68-428"><img src="figure/ggplot2.jpg" style="width: 300px;"></a>
]

---

## Grammar of Graphics elsewhere

- [Make plots online with plotly](https://plot.ly/)

- [Leland Wilkinson](https://research.tableau.com/user/leland-wilkinson) works at [Tableau](https://www.tableau.com/)

- The Grammar of Graphics provides a theoretical framework for building and deciphering statistical graphics

---

class: center, middle

# What is a statistical graphic?
--

## A `mapping` of <br> `data` variables
--

## to <br> `aes()`thetic attributes

--
## of <br> `geom_`etric objects.

---

class: inverse, center, middle

## Back to basics

---

### Consider the following data in tidy format:

```{r}
simple_ex <-
  data_frame(
    A = c(1980, 1990, 2000, 2010),
    B = c(1, 2, 3, 4),
    C = c(3, 2, 1, 2),
    D = c("low", "low", "high", "high")
  )
simple_ex
```

<!-- Copy to chalkboard/whiteboard -->

---

### Consider the following data in tidy format:

```{r echo=FALSE}
simple_ex %>% knitr::kable(format="html")
```


- Sketch the graphics below on paper, where the `x`-axis is variable `A` and the `y`-axis is variable `B`

1. <small>A scatter plot</small>
1. <small>A scatter plot where the `color` of the points corresponds to `D`</small>
1. <small>A scatter plot where the `size` of the points corresponds to `C`</small>
1. <small>A line graph</small>
1. <small>A line graph where the `color` of the line corresponds to `D` with points added that are all forestgreen of size 4.</small>

---

## Reproducing the plots in <small>`ggplot2`</small>

### 1. A scatterplot

```{r, eval=FALSE}
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_point()
```
--

```{r, echo=FALSE, fig.height=4.5}
ggplot(data = simple_ex, aes(x = A, y = B)) + 
  geom_point()
```


---


## Reproducing the plots in <small>`ggplot2`</small>

### 2. A scatter plot where the `color` of the points corresponds to `group`

```{r, eval=FALSE}
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_point(mapping = aes(color = D))
```
--

```{r, echo=FALSE, fig.height=4.5}
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_point(mapping = aes(color = D))
```


---

## Reproducing the plots in <small>`ggplot2`</small>

### 3. A scatter plot where the `size` of the points corresponds to `C`

```{r, eval=FALSE}
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B, size = C)) + 
  geom_point()
```
--

```{r, echo=FALSE, fig.height=4.5}
ggplot(data = simple_ex, mapping = aes(x = A, y = B, size = C)) + 
  geom_point()
```


---

## Reproducing the plots in <small>`ggplot2`</small>

### 4. A line graph

```{r, eval=FALSE}
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_line()
```
--

```{r, echo=FALSE, fig.height=4.5}
ggplot(data = simple_ex, aes(x = A, y = B)) + 
  geom_line()
```


---

## Reproducing the plots in <small>`ggplot2`</small>

### 5. A line graph where the `color` of the line corresponds to `D` with points added that are all blue of size 4.

```{r, eval=FALSE}
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_line(mapping = aes(color = D)) +
  geom_point(color = "forestgreen", size = 4)
```
--

```{r, echo=FALSE, fig.height=4}
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_line(mapping = aes(color = D)) +
  geom_point(color = "forestgreen", size = 4)
```

---

# The Five-Named Graphs 

## The 5NG of data viz

- Scatterplot: `geom_point()`
- Line graph: `geom_line()`
--

- Histogram: `geom_histogram()`
- Boxplot: `geom_boxplot()`
- Bar graph: `geom_bar()`


---

class: center, middle

## More `ggplot2` examples

---

## Histogram

```{r fig.height=5.5}
library(nycflights13)
ggplot(data = weather, mapping = aes(x = humid)) +
  geom_histogram(bins = 20, color = "black", fill = "darkorange")
```

---

## Boxplot (broken)

```{r fig.height=5.5}
library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, y = humid)) +
  geom_boxplot()
```

---


## Boxplot (almost fixed)

```{r fig.height=5.5}
library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, group = month, y = humid)) +
  geom_boxplot() 
```
---

## Boxplot (fixed)

```{r fig.height=5.5}
library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, group = month, y = humid)) +
  geom_boxplot() +
  scale_x_continuous(breaks = 1:12)
```

---

## Bar graph

```{r}
library(fivethirtyeight)
ggplot(data = bechdel, mapping = aes(x = clean_test)) +
  geom_bar()
```

---

## How about over time?

- Hop into `dplyr`

```{r}
library(dplyr)
year_bins <- c("'70-'74", "'75-'79", "'80-'84", "'85-'89",
               "'90-'94", "'95-'99", "'00-'04", "'05-'09",
               "'10-'13")
bechdel <- bechdel %>%
  mutate(five_year = cut(year, 
                         breaks = seq(1969, 2014, 5), 
                         labels = year_bins)) %>% 
  mutate(clean_test = factor(clean_test, 
                             levels = c("nowomen", "notalk", "men",
                                        "dubious", "ok")))
```

---

## How about over time? (Stacked)

```{r fig.width=11}
library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar()
```

---

## How about over time? (Side-by-side)

```{r fig.width=11}
library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar(position = "dodge")
```

---

## How about over time? (Stacked proportional)

```{r fig.width=11}
library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar(position = "fill", color = "black")
```

---

class: center, middle

## The `tidyverse`/`ggplot2` is for beginners and <br> for data science professionals!

<a href="https://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/"><img src="figure/bechdel.png" style="width: 500px;"/><a>

---

## Practice

Produce appropriate 5NG with R package & data set in [ ], e.g., [`nycflights13` $\rightarrow$ `weather`] 

<!--
Try to look through the help documentation/Google to improve your plots
-->

1. Does `age` predict `recline_rude`? <br> [`fivethirtyeight` $\rightarrow$ `na.omit(flying)`]

2. Distribution of `age` by `sex` <br> [`okcupiddata` $\rightarrow$ `profiles`]

3. Does `budget` predict `rating`? <br> [`ggplot2movies` $\rightarrow$ `movies`]

4. Distribution of log base 10 scale of `budget_2013` <br> [`fivethirtyeight` $\rightarrow$ `bechdel`]

---

### HINTS

```{r echo=FALSE, fig.height=7, fig.width=10.5}
library(gridExtra)
library(fivethirtyeight)
library(ggplot2movies)
library(okcupiddata)

p1 <- ggplot(data = na.omit(flying), mapping = aes(fill = recline_rude, x = age)) + geom_bar(position = "fill") + ggtitle("Problem 1") + theme_gray(base_size = 20)

p2 <- ggplot(data = profiles, mapping = aes(x = sex, y = age)) +
  geom_boxplot() + ggtitle("Problem 2") + theme_gray(base_size = 20)

p3 <- ggplot(data = movies, mapping = aes(x = budget, y = rating)) +
  geom_point() + ggtitle("Problem 3") + theme_gray(base_size = 20)

p4 <- ggplot(data = bechdel, mapping = aes(x = log(budget_2013, 10))) +
  geom_histogram(color = "white", bins = 10) + ggtitle("Problem 4") +
  theme_gray(base_size = 20)

grid.arrange(p1, p2, p3, p4, ncol = 2, padding = unit(0.5, "line"),
             widths = c(2.6, 1.4))

```

---

class: inverse, center, middle

# DEMO of `ggplot2` in RStudio

---

class: center, middle

### Determining the appropriate plot

<a href="https://coggle.it/diagram/V_G2gzukTDoQ-aZt"><img src="figure/viz_mindmap.png" style="width: 400px;"/><a>

---

class: inverse, center, middle
name: wrangle

## Day 2

## Data Wrangling

---

### `gapminder` data frame in the `gapminder` package

```{r rows.print=15}
library(gapminder)
gapminder
```

---

## Base R versus the `tidyverse`

Say we wanted mean life expectancy across all years for Asia

--

```{r}
# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)
```
--
 
```{r}
library(dplyr)
gapminder %>% filter(continent == "Asia") %>%
  summarize(mean_exp = mean(lifeExp))
```

---

## The pipe `%>%`

`r knitr::include_graphics("figure/pipe.png")` &emsp; &emsp;`r knitr::include_graphics("figure/MagrittePipe.jpg")`
--

- A way to chain together commands
--

- It is *essentially* the `dplyr` equivalent to the <br> `+` in `ggplot2`

---

## The 5NG of data viz
--

### `geom_point()`<br> `geom_line()` <br> `geom_histogram()`<br>  `geom_boxplot()`<br> `geom_bar()`

---

# The Five Main Verbs (5MV) of data wrangling

### `filter()` <br> `summarize()` <br> `group_by()` <br> `mutate()` <br> `arrange()`

---

## `filter()`

- Select a subset of the rows of a data frame. 

- The arguments are the "filters" that you'd like to apply.
--

```{r}
library(gapminder); library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
head(gap_2007)
```

- Use `==` to compare a variable to a value

---

## Logical operators

- Use `|` to check for any in multiple filters being true:
--

```{r eval=FALSE}
gapminder %>% 
  filter(year == 2002 | continent == "Europe")
```
--

```{r echo=FALSE}
gapminder %>% 
  filter(year == 2002 | continent == "Europe")
```

---

## Logical operators

- Use `&` or `,` to check for all of multiple filters being true:
--

```{r eval=FALSE}
gapminder %>% 
  filter(year == 2002, continent == "Europe")
```

```{r echo=FALSE}
gapminder %>% 
  filter(year == 2002, continent == "Europe")
```

---

## Logical operators

- Use `%in%` to check for any being true <br> (shortcut to using `|` repeatedly with `==`)
--

```{r eval=FALSE}
gapminder %>% 
  filter(country %in% c("Argentina", "Belgium", "Mexico"),
         year %in% c(1987, 1992))
```
--

```{r echo=FALSE}
gapminder %>% 
  filter(country %in% c("Argentina", "Belgium", "Mexico"),
         year %in% c(1987, 1992))
```


---

## `summarize()`

- Any numerical summary that you want to apply to a column of a data frame is specified within `summarize()`.

```{r eval=FALSE}
max_exp_1997 <- gapminder %>% 
  filter(year == 1997) %>% 
  summarize(max_exp = max(lifeExp))
max_exp_1997
```
--

```{r echo=FALSE}
max_exp_1997 <- gapminder %>% 
  filter(year == 1997) %>% 
  summarize(max_exp = max(lifeExp))
max_exp_1997
```



---

### Combining `summarize()` with `group_by()`

When you'd like to determine a numerical summary for all
levels of a different categorical variable

```{r eval=FALSE}
max_exp_1997_by_cont <- gapminder %>% 
  filter(year == 1997) %>% 
  group_by(continent) %>%
  summarize(max_exp = max(lifeExp))
max_exp_1997_by_cont
```

--
```{r echo=FALSE}
max_exp_1997_by_cont <- gapminder %>% 
  filter(year == 1997) %>% 
  group_by(continent) %>%
  summarize(max_exp = max(lifeExp))
max_exp_1997_by_cont
```

---

### Without the `%>%`

It's hard to appreciate the `%>%` without seeing what the code would look like without it:

```{r}
max_exp_1997_by_cont <- 
  summarize(
    group_by(
      filter(
        gapminder, 
          year == 1997), 
      continent),
    max_exp = max(lifeExp))
max_exp_1997_by_cont
```



---

## `ggplot2` revisited

For aggregated data, use `geom_col`

```{r fig.height=4.5}
ggplot(data = max_exp_1997_by_cont, 
       mapping = aes(x = continent, y = max_exp)) +
  geom_col()
```
---


## The 5MV

- `filter()`
- `summarize()`
- `group_by()`
--

- `mutate()`

--
- `arrange()`

---

## `mutate()`

- Allows you to 
    1. <font color="blue">create a new variable with a specific value</font> OR
    2. create a new variable based on other variables OR
    3. change the contents of an existing variable

--

```{r}
gap_plus <- gapminder %>% mutate(just_one = 1)
head(gap_plus)
```

---

## `mutate()`

- Allows you to 
    1. create a new variable with a specific value OR
    2. <font color="blue">create a new variable based on other variables</font> OR
    3. change the contents of an existing variable

--

```{r}
gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
head(gap_w_gdp)
```

---

## `mutate()`

- Allows you to 
    1. create a new variable with a specific value OR
    2. create a new variable based on other variables OR
    3. <font color="blue">change the contents of an existing variable</font>

--

```{r}
gap_weird <- gapminder %>% mutate(pop = pop + 1000)
head(gap_weird)
```

---

## `arrange()`

- Reorders the rows in a data frame based on the values of one or more variables
--

```{r}
gapminder %>%
  arrange(year, country)
```

---

## `arrange()`

- Can also put into descending order
--

```{r desc}
gapminder %>%
  filter(year > 2000) %>%
  arrange(desc(lifeExp))
```

---

## Don't mix up `arrange` and `group_by`

- `group_by` is used (mostly) with `summarize` to calculate summaries over groups

- `arrange` is used for sorting

---

## Don't mix up `arrange` and `group_by`

This doesn't really do anything useful

```{r rows.print=10}
gapminder %>% group_by(year)
```

---

## Don't mix up `arrange` and `group_by`

But this does

```{r rows.print=10}
gapminder %>% arrange(year)
```

---

## Changing of observation unit

True or False
> Each of `filter`, `mutate`, and `arrange` change the observational unit.

--

True or False
> `group_by() %>% summarize()` changes the observational unit.

<!-- 
Draw diagram for average monthly temp aggregated like on rstudio::conf slides
-->

---

## What is meant by "joining data frames" and <br> why is it useful?

--

```{r echo=FALSE, fig.align='center'}
knitr::include_graphics("https://ismayc.github.io/moderndiver-book/images/join-inner.png")
```

---

### Does cost of living in a state relate to whether police officers live in the cities they patrol?  What about state political ideology?


```{r eval=FALSE}
library(fivethirtyeight); library(readr); library(dplyr)
police2 <- police_locals %>% select(city, all)
ideology <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/ideology.csv")
police_join <- inner_join(x = police2, y = ideology, by = "city")
police_join
```

--

```{r echo=FALSE}
library(fivethirtyeight); library(readr); library(dplyr)
police2 <- police_locals %>% select(city, all)
ideology <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/ideology.csv")
police_join <- inner_join(x = police2, y = ideology, by = "city")
police_join
```


---

```{r}
url <- paste0("https://ismayc.github.io/",
              "Effective-Data-Storytelling-using-the-tidyverse/",
              "datasets/cost_of_living.csv")
cost_of_living <- read_csv(url)
police_join_cost <- inner_join(
    x = police_join, 
    y = cost_of_living, 
    by = "state"
  )
police_join_cost %>% select(-state) %>% arrange(city)
```

---

### Does cost of living in a state relate to whether police officers live in the cities they patrol?  What about state political ideology?

```{r fig.width=10}
ggplot(data = police_join_cost,
       mapping = aes(x = index, y = all)) +
  geom_point(mapping = aes(color = state_ideology)) +
  labs(x = "Cost of Living Index", y = "% Officers Living in City")
```

---

## Practice <small><small>(Lay out what the resulting table should look like on paper first.)</small></small>

Use the 5MV to answer problems from R data packages, e.g., [`nycflights13` $\rightarrow$ `weather`] 

1. What is the maximum arrival delay for each carrier departing JFK? [`nycflights13` $\rightarrow$ `flights`]

2. Determine the top five movies in terms of domestic return on investment for 2013 scaled data<br> [`fivethirtyeight` $\rightarrow$ `bechdel`]

3. [Include the name of the destination airport as a column in the `flights` data frame](http://r4ds.had.co.nz/diagrams/relational-nycflights.png) <br> [`nycflights13` $\rightarrow$ `flights`, `airports`]

---


class: inverse, center, middle

# DEMO of `dplyr` in RStudio

---

<!--
## Homework for tomorrow

- Review the Data Import cheatsheet and use the `readr`, `readxl`, and `haven` packages to read in data from CSVs, Excel XLS/XLSX files, STATA/SPSS files

- Use `ggplot2` and `dplyr` to make graphics and produce data wrangling on data sets of interest to you read in from the previous step
-->

class: inverse, center, middle


# Data Importing and Tidying

---

## `tidyverse` packages

**IMPORT**

- `haven` for SPSS, SAS, and Stata data files
- `jsonlite` for JSON files
- `readxl` for XLS and XLSX files
- `readr` for CSV and TSV files (and R specific RDS files)

<br>

**TIDYING**

- `tidyr` for converting "messy" into "tidy" data frames

---

name: import

## Basics of Importing

<small>We will begin by downloading and importing a variety of different "messy" data sets.  You can download all of them in a ZIP file at <http://bit.ly/reed-messy-data>.  The links below go to the original sources.  I've converted these original sources into different formats.</small>


- [Life Expectancy by year for countries since 1951 (CSV)](https://spreadsheets.google.com/pub?key=phAwcNAVuyj2tPLxKvvnNPA&output=xls)

- [World Health Organization TB data (Stata DTA)](http://www.who.int/tb/country/data/download/en/)

- [Annual Estimates of Population by Age, Sex, Race, and Hispanic Origin for Oregon Counties (XLSX)](https://www.census.gov/popest/data/counties/asrh/2011/CC-EST2011-alldata.html)
- [Educational attainment for the U.S., States, and counties, 1970-2014 (JSON)](http://www.ers.usda.gov/data-products/county-level-data-sets/download-data.aspx)
- [County level results for 2012 POTUS election from The Guardian (SPSS SAV)](https://fusiontables.google.com/DataSource?docid=1qcQLqrAIAe3RcEfdWSm_QcXMLmteVg4uSpSs1rM)


---

class: center, middle, inverse

## Demonstration in RStudio

---

## Practice

- Download the four other files and import them into R using the appropriate package
    - You may need to check out the help pages for the different packages
        - [`haven` for SPSS, SAS, and Stata data files](http://haven.tidyverse.org/)
        - [`jsonlite` for JSON files](https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html)
        - [`readxl` for XLS and XLSX files](https://github.com/hadley/readxl)
        - [`readr` for CSV/TSV files (& R specific RDS files)](http://readr.tidyverse.org/)
    - Give them the following names: `who_df`, `county_pop_df`, `edu_county_df`, and `potus12_df`
  
---

name: tidy
class: inverse, middle, center

# Tidy Data


---

<img src="http://garrettgman.github.io/images/tidy-1.png" style="width: 750px;"/>

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

The third point means we don't mix apples and oranges.

---

## What is Tidy Data?

1. Each observation forms a row. In other words, each row corresponds to a single instance of an <u>observational unit</u>
1. Each variable forms a column:
    + Some variables may be used to identify the <u>observational units</u>. 
    + For organizational purposes, it's generally better to put these in the left-hand columns
1. Each type of observational unit forms a table with the entries in the table corresponding to values.

---

## Differentiating between <u>neat</u> data and <u>tidy</u> data

- Colloquially, they mean the same thing
- But in our context, one is a subset of the other. 

<br>

<u>Neat</u> data is 
  - easy to look at, 
  - organized nicely, and 
  - in table form.

--

<u>Tidy</u> data is neat but also abides by a set of three rules.

---

class: center, middle

<a href="http://stream1.gifsoup.com/view8/20150404/5192859/lebowski-abides-o.gif"><img src="figure/lebowski-abides-o.gif" style="width: 450px;"/></a>


<img src="figure/tidy-1.png" alt="Drawing" style="width: 750px;"/>

---

## Is this tidy?

```{r echo=FALSE, message=FALSE, warning=FALSE}
library(fivethirtyeight)
set.seed(2)
bechdel %>% sample_n(12) %>%
  select(year, title, clean_test, budget_2013) %>%
  arrange(title) %>%
  kable(format = "html")
```


---

name: demscore

## How about this? Is this tidy?

```{r echo=FALSE, message=FALSE, warning=FALSE}
dem_score <- read_csv("data/dem_score.csv")
knitr::kable(dem_score %>% slice(1:12), format = "html")
```

---

name: whytidy

## Why is tidy data important?

- Think about trying to plot democracy score across years in the simplest way possible.
--

- It would be much easier if the data looked like what follows instead so we could put 
    - `year` on the `x`-axis and 
    - `dem_score` on the `y`-axis.

---

## Tidy is good

```{r echo=FALSE}
dem_score_tidy <- dem_score %>% 
  gather(-country, key = "year", value = "dem_score") %>% 
  mutate(year = as.numeric(year)) 
set.seed(2)
dem_score_tidy %>% sample_n(13) %>% arrange(country) %>% 
  knitr::kable(format = "html")
```

---

## Let's plot it

- Plot the line graph for three countries using `ggplot`

```{r}
dem_score4 <- dem_score_tidy %>%
  filter(country %in% c("Pakistan", "Portugal", "Uruguay"))
ggplot(data = dem_score4, mapping = aes(x = year, y = dem_score)) +
  geom_line(mapping = aes(color = country))
```

---

## Tidying

### The Life Expectancy by year data

```{r}
library(readr)
life_exp_df <- read_csv("data/le_mess.csv")
#View(life_exp_df)
```

```{r echo=FALSE}
knitr::include_graphics("figure/le_mess.png")
```

---

## Doing the "tidying"/reshaping

```{r message=FALSE}
library(tidyr)
library(dplyr)
(life_exp_tidy <- life_exp_df %>% 
    gather(key = "year", value = "life_exp", -country))
```

---

## Tidying

#### World Health Organization TB data (Stata DTA)


```{r}
library(haven)
who_df <- read_dta("data/who.dta")
#View(who_df)
```

```{r echo=FALSE}
knitr::include_graphics("figure/who_mess.png")
```

---

## WHO ugly...

- This data set contains strange variable names that will require us to look up their meaning in the corresponding [**data dictionary**](https://extranet.who.int/tme/generateCSV.asp?ds=dictionary).

--

- What do we know?
     - `country`, `iso2`, and `iso3` are redundant and identify the country
     - `year` is a variable and it looks like it corresponds to each country
- But what in the world is `new_sp_m014`? And the other columns?...

---

## First step

Like before, we can `gather` these non-`country` and non-`year` variables together:

```{r}
who_tidy <- who_df %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "value")
```

We can now see what this has done to the data frame:

```{r eval=FALSE}
View(who_tidy)
```

---

## Data dictionary saves us some...

1. The first three letters of entries in the `key` column correspond to `new` or `old` cases of TB.
2. The next two letters (after the `_`) correspond to TB type:
    - `rel` for relapse, `ep` for extrapulmonary TB
    - `sn` for smear negative, `sp` for smear positive
3. The next letter after the second `_` corresponds to the sex of the TB patient.
4. The remaining numbers correspond to age group:
    - `014` for 0 to 14 years
    - ...
    - `65` for 65 or older

---

- Looking over the entries of `key` in `who_tidy`, do you see anything else that doesn't match the pattern?

- It may be easier to remember that the entries of `key` correspond to column names in `who_df`:

```{r}
names(who_df)
```

---

## A fix using `stringr`

You can replace all of the entries in `key` that contained `newrel` with `new_rel` via

```{r}
library(stringr)
library(dplyr)
who_tidy <- who_tidy %>% 
  mutate(key = str_replace(
      string = key, 
      pattern = "newrel", 
      replacement = "new_rel")
    )
```

---

## Pulling apart variables

The entry `new_sp_m014` is actually four different variables.  Use the `separate` function to pull them apart:

```{r}
who_tidy <- who_tidy %>% 
  separate(col = key, into = c("new", "type", "sexage"), sep = "_")
who_tidy
```

---

## One more pull apart

- We need to pull `sexage` into two different variables.  
- If you use `?separate`, you'll see that the following is an option:

```{r}
(who_tidy <- who_tidy %>% 
  separate(col = sexage, into = c("sex", "age"), sep = 1))
```

---

## Final cleaning

- `iso2` and `iso3` are redundant if we have `country`
- `new` is constant since this only has new cases of TB
- Let's just ignore missing values here
- We also know that `value` is actually `cases` so we can `rename` that column.

```{r, tidy=FALSE}
who_tidy <- who_tidy %>% select(-iso2, -iso3, -new) %>% 
  na.omit() %>% rename("cases" = value)
head(who_tidy, 5)
```

---

## The power of the `%>%` (pipe)

Everything we did above can be chained into one sequence:


```{r eval=FALSE}
who_tidy <- who_df %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "value") %>%  
  mutate(key = str_replace(key, pattern = "newrel", 
                           replacement = "new_rel")) %>% 
  separate(col = key, into = c("new", "type", "sexage"), sep = "_") %>% 
  separate(col = sexage, into = c("sex", "age"), sep = 1) %>% 
  select(-iso2, -iso3, -new) %>% 
  na.omit() %>% 
  rename("cases" = value)
```

---

## BONUS

### A Gapminder tidy dataset read in from a Google Sheet!

- I've updated the `gapminder` data set available in the `gapminder` R data package by Jenny Bryan [here](https://github.com/jennybc/gapminder/).  Jenny provides [instructions](https://github.com/jennybc/gapminder/tree/master/data-raw) for recreating the data.

---

## BONUS

- You can download the updated data from Google Sheets by running the following in R:

```{r eval=FALSE}
# Link is https://docs.google.com/spreadsheets/d/18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA/edit?usp=sharing
library(googlesheets)
updated_gap <- gs_key("18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA") %>%
  gs_read()
```


- A script going through the steps to take the "messy" original Gapminder.org files and turn them into this tidy dataset is available [here](https://raw.githubusercontent.com/ismayc/tidyverse_workshops/master/importing_tidying/data/cleaning.R).

---

## Practice (after the bootcamp)

Try to tidy the other data sets you downloaded and imported or other data sets you have!

---

name: model
class: inverse, center, middle

# Data Modeling


---


## Modeling

1. Experience with `ggplot2` package and knowledge of the Grammar of Graphics primes students for modeling
1. Use of the `broom` package to unpack regression output into a tidy format

---

## 1. `ggplot2` Primes Regression

Example:

* All Alaskan Airlines and Frontier flights leaving NYC in 2013
* We want to study the relationship between temperature and departure delay
* For summer (June, July, August) and non-summer months separately

Involves four variables: 

- `carrier`, `temp`, `dep_delay`, `summer`

---

## 1. `ggplot2` Primes Regression

```{r}
flights_subset <- flights %>% 
  filter(carrier == "AS" | carrier == "F9") %>% 
  left_join(weather, by=c("year", "month", "day", "hour", "origin")) %>% 
  filter(dep_delay < 250) %>% 
  mutate(summer = ifelse(month == 6 | month == 7 | month == 8, "Summer Flights", "Non-Summer Flights"))
```

---

```{r, fig.height=6, fig.width=10.5}
ggplot(data = flights_subset, 
    mapping = aes(x = temp, y = dep_delay, color = carrier)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ summer)
```

---

## 1. `ggplot2` Primes Regression

Why? Dig deeper into data. Look at `origin` and `dest` variables as well:

<br> 

```{r}
flights_subset %>% 
  group_by(carrier, origin, dest) %>% 
  summarise(`Number of Flights` = n())
```

---


## 2. `broom` Package

* The `broom` package takes the messy output of built-in modeling functions in R, such as
`lm`, `nls`, or `t.test`, and turns them into tidy data frames.
* Fits in with `tidyverse` ecosystem
* This works for [many R data types](https://github.com/tidyverse/broom#available-tidiers)!

---

## 2. `broom` Package

In our case, `broom` functions take `lm` objects as inputs and return the following in tidy format!

* `tidy()`: regression output table
* `augment()`: point-by-point values (fitted values, residuals, predicted values)
* `glance()`: scalar summaries like <large> $R^2$ </large>, 

---

## 2. `broom` Package

Example

```{r}
library(tidyverse)
library(nycflights13)
library(knitr)
library(broom)
set.seed(2017)

# Load Alaska data, deleting rows that have missing departure delay
# or arrival delay data
alaska_flights <- flights %>% 
  filter(carrier == "AS") %>% 
  filter(!is.na(dep_delay) & !is.na(arr_delay)) %>% 
  sample_n(50)
```

---

```{r}
# Exploratory Data Analysis-----------------------------------------
# Plot of sample of points:
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
   geom_point()
```

---

```{r}
# Correlation coefficient:
alaska_flights %>% summarize(correl = cor(dep_delay, arr_delay))

# Add regression line
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red")
```

---

```{r}
# Fit Regression and Study Output with broom Package-------------------
# Fit regression
options(scipen = 6) # Set scientific notation
delay_fit <- lm(formula = arr_delay ~ dep_delay, data = alaska_flights)

# 1. broom::tidy() regression table with confidence intervals 
# and no p-value stars
regression_table <- delay_fit %>% 
  tidy(conf.int = TRUE)
regression_table
```

---

```{r}
# 2. broom::augment() for point-by-point values
regression_points <- delay_fit %>% 
  augment() %>% 
  select(arr_delay, dep_delay, .fitted, .resid) 
regression_points
```

---

```{r}
# and for prediction
new_flights <- data_frame(dep_delay = c(25, 30, 15))
delay_fit %>% 
  augment(newdata = new_flights)
```

---

```{r}
# 3. broom::glance() scalar summaries of regression
regression_summaries <- delay_fit %>% 
  glance() 
regression_summaries
```

---

```{r}
# Residual Analysis------------------------------------------------------
ggplot(data = regression_points, mapping = aes(x = .resid)) +
  geom_histogram(binwidth=10, color = "white") +
  geom_vline(xintercept = 0, color = "blue")
```

---

```{r}
ggplot(data = regression_points, mapping = aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 0, color = "blue")
```

---

```{r}
ggplot(data = regression_points, mapping = aes(sample = .resid)) +
  stat_qq()
```

---

class: center, middle, inverse
name: rmarkdown

# Data Communicating


---

## What is Markdown?

 - A "plaintext formatting syntax"
 - Type in plain text, render to more complex formats
 - One step beyond writing a `txt` file
 - Render to HTML, PDF, DOCX, etc. using Pandoc

---

## What does it look like?

.left-column[
```
  # Header 1
  
  ## Header 2
  
  Normal paragraphs of text go here.
  
  **I'm bold**
  
  [links!](http://rstudio.com)
  
   * Unordered
   * Lists   
   
  And  Tables
  ---- -------
  Like This
  
```
]

.right-column[
<img src="figure/markdown.png" alt="markdown" style="width: 270px;"/>
]

---

## What is R Markdown?
  
- "Literate programming"
- Embed R code in a Markdown document
- Renders textual output along with graphics

***

.left-column[
```

```{r chunk1}
library(ggplot2)
library(nycflights13)
pdx_flights <- flights %>% 
  filter(dest == "PDX", month == 5)
nrow(pdx_flights)
```

```{r chunk2}
ggplot(data = pdx_flights,
  mapping = aes(x = arr_delay, 
                y = dep_delay)) +
  geom_point()
```

```
]

.right-column[
```{r, fig.width=4.5, fig.height=4, echo=FALSE}
library(ggplot2)
library(nycflights13)
pdx_flights <- flights %>% 
  filter(dest == "PDX", month == 5)
nrow(pdx_flights)
ggplot(data = pdx_flights,
  mapping = aes(x = arr_delay, 
                y = dep_delay)) +
  geom_point()
```
]

---

## Practice

Turn a statistical analysis you have conducted into an R Markdown document.

- You can also publish online easily to <http://rpubs.com>

---

class: middle

# Thanks for attending!

- Slides created via the R package [xaringan](https://github.com/yihui/xaringan) by Yihui Xie
- Source code for these slides and the webpage at <https://github.com/ismayc/poRtland-bootcamp17>